In this lab, we will be demonstrating how to tune and fit random
forest models and gradient-boosted tree models (or “boosted trees”).
We’ll demonstrate each of these using two different data sets – a
regression data set (the Wage data from Lab 6) and a
classification data set with multiple levels of the outcome (the
palmerpenguins data, see here).
library(tidymodels)
library(ISLR)
library(ISLR2)
library(tidyverse)
library(glmnet)
library(modeldata)
library(ggthemes)
library(janitor)
library(naniar)
library(xgboost)
library(ranger)
library(vip)
library(corrplot)
library(palmerpenguins) # for the penguins data
tidymodels_prefer()
The Wage data was described fairly thoroughly in Lab 6 –
it consists of wage information and some information on several other
descriptive variables for about 3,000 males in the mid-Atlantic region
of the US. It’s loaded by default when we load the ISLR or
ISLR2 package(s), so we just convert it to a tibble and use
clean_names() to apply the snake case naming convention to
the column names.
wage <- as_tibble(Wage) %>%
clean_names()
You can revisit some of the EDA we did with this data in Lab 6. Here,
we’ll simply proceed to performing an initial split of the data into
training and testing sets, then apply 5-fold cross-validation to the
training set. Note that we make sure to use strata = "wage"
for both of these functions, stratifying on the outcome variable.
set.seed(3435)
wage_split <- initial_split(wage, strata = "wage")
wage_train <- training(wage_split)
wage_test <- testing(wage_split)
wage_folds <- vfold_cv(wage_train, v = 5, strata = "wage")
Artwork by @allison_horst
One of the classic commonly-used data sets for multiclass
classification problems is iris. However, for several good
reasons, there has been discussion in the data science community about
whether continued use of iris is desirable or ethical. (For
those interested, I recommend this blog post: It’s time to
retire the iris data set.)
The palmerpenguins data set was introduced as an
alternative to iris by Allison Horst (who wrote the R
package). It is very similar in structure, in that it contains
observations on 344 penguins from three different species (as visualized
above) – the Chinstrap, Gentoo, and Adelie penguins – who live on three
islands in the Palmer Archipelago, Antarctica. The data set contains
information about the penguins’ bill length and depth in millimeters,
their flipper length in millimeters, their body mass in grams, and their
sex (male or female).
We will work with the palmerpenguins data set. We read
it in here:
penguins <- as_tibble(palmerpenguins::penguins)
We can visualize any missingness:
vis_miss(penguins)
Here, it will probably just be easier to handle missingness by dropping the observations that have any values missing:
penguins <- penguins %>%
drop_na()
In doing so, we only lose about 11 observations. That’s because about
\(3.2\%\) of sex
observations are missing, and \(0.032(344)
\approx 11\). It looks like one or two other penguins are missing
some other values, but those are also missing their
value for sex, so there are only 11 observations dropped
total.
And we’ll do an initial split and 5-fold cross-validation, again
stratifying on the outcome variable, which here is the species of
penguin (or species):
set.seed(3435)
penguin_split <- initial_split(penguins, strata = "species")
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
penguin_folds <- vfold_cv(penguin_train, v = 5, strata = "species")
Let’s do a little EDA. First of all, how does the outcome variable look – is it balanced (equal counts per class)?
This code also demonstrates how to make a custom color palette by
specifying color hex codes and how to use that palette manually in a
ggplot.
custom_pal <- c("#fb7504", "#c65ccc", "#047374")
penguin_train %>%
ggplot(aes(x = forcats::fct_infreq(species), fill = species,
y = (..count..)/sum(..count..))) +
geom_bar() +
scale_fill_manual(values = custom_pal)+
xlab("Species") +
ylab("Proportion of Training Set") +
theme_minimal() +
theme(legend.position = "none")
Chinstrap penguins are the least common, at only about \(20\%\) of the total, but even so, this doesn’t appear to be a huge imbalance, and we can likely get away without needing to upsample or downsample (although we could certainly try it if we chose).
penguin_train %>%
ggplot(aes(x = island, fill = species)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = custom_pal) +
xlab("Island") +
ylab("Count") +
theme_minimal()
We definitely also might want to include the specific archipelago island as a predictor of penguin species, as shown by the above plot. Although Adelie penguins are found in approximately equal amounts on all three islands, Gentoo penguins were only found on Biscoe and Chinstrap penguins only on Dream.
We can also try visualizing the relationship between flipper length and body mass by species, as shown:
penguin_train %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g,
color = species)) +
geom_point() +
scale_color_manual(values = custom_pal) +
theme_minimal() +
xlab("Flipper Length (mm)") +
ylab("Body Mass (g)") +
labs(color = "Species")
Describe the relationship(s) between body mass, flipper length, and species in the above picture. Do these predictors seem like they can help differentiate between species?
Make a correlation plot of the numeric variables in
penguins. What relationships do you see?
For the wage data, we set up the same recipe as we did
for a single pruned decision tree in Lab 6 – that is, using all
predictors, but removing region (which is a constant) and
logwage (which is a direct linear function of
wage and therefore would make no sense to include). We
dummy-code any nominal predictors and normalize all predictors:
rec_wage <- recipe(wage ~ ., data = wage_train) %>%
step_rm(region, logwage) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
For the penguin data, we remove year, which
just represents the year in which the study was conducted/the penguins
were recorded and likely doesn’t convey much information, and then do
the same steps:
rec_penguin <- recipe(species ~ ., data = penguin_train) %>%
step_rm(year) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
We then set up two separate random forest workflows, one for
regression (rf_reg_wf) and one for classification
(rf_class_wf), or one for the wage data and
one for the penguins data, respectively. The default engine
for random forest models in tidymodels is the
ranger package. We flag three hyperparameters for tuning –
mtry, trees, and min_n.
rf_reg_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
rf_reg_wf <- workflow() %>%
add_model(rf_reg_spec) %>%
add_recipe(rec_wage)
rf_class_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_class_wf <- workflow() %>%
add_model(rf_class_spec) %>%
add_recipe(rec_penguin)
We set up a grid of hyperparameter values to consider. Here we allow
mtry to range from 1 to 6, trees from 200 to
600, and min_n from 10 to 20, and we specify five levels of
each:
rf_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
rf_grid
## # A tibble: 125 × 3
## mtry trees min_n
## <int> <int> <int>
## 1 1 200 10
## 2 2 200 10
## 3 3 200 10
## 4 4 200 10
## 5 6 200 10
## 6 1 300 10
## 7 2 300 10
## 8 3 300 10
## 9 4 300 10
## 10 6 300 10
## # … with 115 more rows
In this code chunk, we actually fit all the random forest models
we’ve specified to each data set. Notice that we have set
eval=FALSE for this code chunk and the following one;
running the models takes a few minutes and is something we want to do as
few times as possible, so we save the results to files and load them
back in later.
tune_reg <- tune_grid(
rf_reg_wf,
resamples = wage_folds,
grid = rf_grid
)
tune_class <- tune_grid(
rf_class_wf,
resamples = penguin_folds,
grid = rf_grid
)
Here we saved the results to two files:
save(tune_reg, file = "tune_reg.rda")
save(tune_class, file = "tune_class.rda")
And finally, in this code chunk, we load the model results back in and take a look at each of them:
load("tune_reg.rda")
load("tune_class.rda")
autoplot(tune_reg) + theme_minimal()
autoplot(tune_class) + theme_minimal()
For the wage data, as we increase mtry, model
performance seems to improve – RMSE tends to decrease and \(R^2\) tends to increase – until we reach
about mtry of 3, at which point it starts to decrease again. The number
of trees, trees, doesn’t make much of a difference overall,
as those lines are virtually on top of each other (and that tends to be
the case). A minimal node size, or min_n, of 20 (the plots
on the far right) seems to produce slightly better results than a
minimum size of 10 (plots on the far left).
We’ll select the optimal random forest model for each data set – in terms of RMSE for the wage data and in terms of ROC AUC for the penguins data:
show_best(tune_reg, n = 1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 600 20 rmse standard 33.7 5 1.06 Preprocessor1_Model1…
best_rf_reg <- select_best(tune_reg)
show_best(tune_class, n = 1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 200 10 roc_auc hand_till 1.00 5 0.000152 Preprocessor1_Model…
best_rf_class <- select_best(tune_class)
How many random forest models are we fitting across folds for
each data set (that is, how many are we fitting for the
wage data? How many for the penguins
data?)
Interpret the autoplot() results for the
penguins data.
For both these data sets, which of the three hyperparameters we tuned seems to have the biggest effect on model performance overall?
To fit boosted trees, we will use the same recipe as before, but now
we will set up two separate boosted tree workflows, one
for regression and one for classification. We’ll tune mtry
and trees again; we could also choose to tune
min_n again, but the learning rate tends
to have a much bigger impact on the performance of gradient-boosted
models, and we don’t want to increase the size of the grid too much, so
we’ll add learn_rate instead of min_n.
The default engine for gradient-boosted trees is
xgboost, which stands for “eXtreme Gradient Boosting.”
bt_reg_spec <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
bt_reg_wf <- workflow() %>%
add_model(bt_reg_spec) %>%
add_recipe(rec_wage)
bt_class_spec <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
bt_class_wf <- workflow() %>%
add_model(bt_class_spec) %>%
add_recipe(rec_penguin)
We set up a grid again, leaving the ranges of mtry and
trees the same as before, but this time specifying the
range for learn_rate. This is the default, which goes from
\(-10\) to \(-1\) in log based ten (the default
transformation; see ?learn_rate), or from \(10^{-10}=1 \times 10^{-10}\) to \(10^{-1}=0.1\).
bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)
bt_grid
## # A tibble: 125 × 3
## mtry trees learn_rate
## <int> <int> <dbl>
## 1 1 200 0.0000000001
## 2 2 200 0.0000000001
## 3 3 200 0.0000000001
## 4 4 200 0.0000000001
## 5 6 200 0.0000000001
## 6 1 300 0.0000000001
## 7 2 300 0.0000000001
## 8 3 300 0.0000000001
## 9 4 300 0.0000000001
## 10 6 300 0.0000000001
## # … with 115 more rows
We now tune all of these models for both data sets.
Note that this code chunk and the following both specify
eval=FALSE, meaning that they will not be run each time the
.Rmd file is knitted – essentially because running them takes so long.
Instead, the results are saved and read back in in a subsequent code
chunk.
tune_bt_reg <- tune_grid(
bt_reg_wf,
resamples = wage_folds,
grid = bt_grid
)
tune_bt_class <- tune_grid(
bt_class_wf,
resamples = penguin_folds,
grid = bt_grid
)
save(tune_bt_reg, file = "tune_bt_reg.rda")
save(tune_bt_class, file = "tune_bt_class.rda")
We read the results in and take a look at the autoplot()
for each set of models:
load("tune_bt_reg.rda")
load("tune_bt_class.rda")
autoplot(tune_bt_reg) + theme_minimal()
autoplot(tune_bt_class) + theme_minimal()
Then, just as before, we’ll select the optimal boosted tree model for each data set in terms of RMSE and ROC AUC, respectively:
show_best(tune_bt_reg, n = 1)
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 200 0.1 rmse standard 34.1 5 1.02 Preprocessor1_M…
best_bt_reg <- select_best(tune_bt_reg)
show_best(tune_bt_class, n = 1)
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 300 0.1 roc_auc hand_till 1 5 0 Preprocessor1_M…
best_bt_class <- select_best(tune_bt_class)
Let’s make a table to present the RMSE and ROC AUC values of our best-fitting models (across folds) for each of these two data sets:
| Random Forest | Boosted Tree | |
|---|---|---|
| RMSE | 33.74469 | 34.07487 |
| Random Forest | Boosted Tree | |
|---|---|---|
| ROC AUC | 0.9998485 | 1 |
We’ll go with the best random forest model for the wage data and the best boosted tree model for the penguin data. Here, we fit each of those to the entire training set(s) respectively.
Note that we specify importance = "impurity" in the
workflow setup above; this is so that we can use
extract_fit_parsnip() and vip() to create and
view a variable importance plot.
final_rf_model <- finalize_workflow(rf_reg_wf, best_rf_reg)
final_rf_model <- fit(final_rf_model, wage_train)
final_rf_model %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
This plot tells us that the three most useful predictors of wage in this random forest model are whether or not someone has an advanced degree, their age, and whether or not they have health insurance.
Let’s use the model to make predictions for the testing data and take a look at its testing RMSE:
final_rf_model_test <- augment(final_rf_model, wage_test)
rmse(final_rf_model_test, truth = wage, .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 34.4
We could also view a scatterplot of the actual wage values in the testing set versus the model-predicted values:
final_rf_model_test %>%
ggplot(aes(x = wage, y = .pred)) +
geom_point(alpha = 0.5) +
theme_minimal()
Now for our boosted tree model. We did not have to include
importance = "impurity" in the workflow for this because
the package xgboost, the default engine, stores variable
importance information by default.
final_bt_model <- finalize_workflow(bt_class_wf, best_bt_class)
final_bt_model <- fit(final_bt_model, penguin_train)
Here is the VIP plot:
final_bt_model %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
This tells us that the top three most important predictors of species were bill length and depth and flipper length.
Let’s use the model to make predictions for the testing data and take a look at its testing ROC AUC.
Notice that there is a key change here to accommodate the fact that
there are more than two levels of our outcome variable. Rather than
specifying .pred_Yes, for instance, we have to select all
the columns of predicted probabilities – the probability that an
observation is species Adelie, Chinstrap, and Gentoo. We can do that by
typing .pred_Adelie:.pred_Gentoo, which selects not only
the first and last columns but any columns in the middle.
final_bt_model_test <- augment(final_bt_model,
penguin_test) %>%
select(species, starts_with(".pred"))
roc_auc(final_bt_model_test, truth = species, .pred_Adelie:.pred_Gentoo)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 1
The ROC AUC on the testing data is basically 1; the model has done a basically perfect job classifying species. Let’s generate the three individual ROC curves:
roc_curve(final_bt_model_test, truth = species, .pred_Adelie:.pred_Gentoo) %>%
autoplot()
And the confusion matrix:
conf_mat(final_bt_model_test, truth = species,
.pred_class) %>%
autoplot(type = "heatmap")
The best-performing boosted tree model has literally done an almost perfect job. It has only misclassified one single penguin in the entire testing set – a penguin that is actually of species Adelie was predicted to be of species Chinstrap.
Interpret the scatterplot of predicted vs. actual values for the wage data. Does our model tend to overpredict or underpredict wage? Do you think excluding those observations with an extremely high wage might affect model performance? Do you think it might make sense to do so? Why or why not?
Do you think using a different threshold for probabilities would result in the penguin model’s correctly predicting every single observation? What threshold value(s) was used in creating the confusion matrix here? How do you know?
The free book Tidy Modeling with R is strongly recommended.